Skip to content

Conversation

@james-rms
Copy link
Contributor

@james-rms james-rms commented Dec 3, 2025

Which issue does this PR close?

Rationale for this change

Today, users that attempt to copy a >5GB object in S3 using object_store will see this error:

Server returned non-2xx status code: 400 Bad Request: 
<Error><Code>InvalidRequest</Code><Message>
The specified copy source is larger than the maximum allowable size for a copy source: 5368709120
</Message></Error>

The way to get around this problem per AWS's docs is to do the copy in several parts using multipart copies. This PR adds that functionality to the AWS client.

It adds two additional configuration parameters:

    /// The size threshold above which copy uses multipart copies under the hood. defaults to 5GB.
    multipart_copy_threshold: u64
    /// When using multipart copies, the part size used. Defaults to 5GB.
    multipart_copy_part_size: u64

The defaults are chosen to minimise surprise: if people are used to copies not requiring several requests, we don't switch to that method until it's absolutely necessary, and when necessary, we use as few parts as possible.

What changes are included in this PR?

See above.

Are there any user-facing changes?

Yes - these configuration parameters should be covered by the docstring changes.

@james-rms james-rms force-pushed the jrms/aws-multipart-copy branch 3 times, most recently from 4719ef4 to 09cef9b Compare December 4, 2025 12:11
@james-rms james-rms force-pushed the jrms/aws-multipart-copy branch from 09cef9b to acc8cc4 Compare December 4, 2025 12:14
@james-rms james-rms marked this pull request as ready for review December 4, 2025 12:26
@tustvold
Copy link
Contributor

tustvold commented Dec 4, 2025

I think this probably warrants a higher level ticket to discuss how we should support this, as a start it would be good to understand how other stores, i.e. GCS and Azure handle this, so that we can develop an abstraction that makes sense.

In particular I wonder if adding this functionality would make more sense as part of the multipart upload functionality? This of course depends on what other stores support.

In general filing an issue first to get consensus on an approach is a good idea before jumping in on an implementation

@james-rms
Copy link
Contributor Author

Great, created #563.

@james-rms
Copy link
Contributor Author

@tustvold updated to pass CI and a small tweak to avoid overflow.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @james-rms and @tustvold -- the high level idea seems reasonable to me, but I think this code needs tests (maybe unit tests?) or something otherwise we may break the functionality inadvertently in some future refactor

@james-rms james-rms force-pushed the jrms/aws-multipart-copy branch from a8d535c to 1fdf7d6 Compare January 6, 2026 03:16
@james-rms
Copy link
Contributor Author

@tustvold I've refactored slightly for unit tests and added a couple of integration tests as well. Please take another look when you have some time.

src/aws/mod.rs Outdated
mode,
extensions: _,
} = options;
// Determine source size to decide between single CopyObject and multipart copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way we can avoid this, e.g. we try CopyObject normally and on error fallback to multipart? Otherwise this adds an additional S3 roundtrip to every copy request.

.unwrap();

let mut payload = BytesMut::zeroed(10 * 1024 * 1024);
rand::fill(&mut payload[..]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and well tested, sorry for the delay in reviewing, been absolutely swamped.

I think we need to find a way to avoid regressing the common case of files smaller than 5GB, e.g. by first attempting CopyObject and then falling back if it errors (I am presuming S3 gives a sensible error here).

@james-rms
Copy link
Contributor Author

james-rms commented Jan 14, 2026

I think we need to find a way to avoid regressing the common case of files smaller than 5GB, e.g. by first attempting CopyObject and then falling back if it errors (I am presuming S3 gives a sensible error here).

This is what we get back from S3:

<Error>
  <Code>InvalidRequest</Code>
  <Message>The specified copy source is larger than the maximum allowable size for a copy source: 5368709120</Message>
  <RequestId>...</RequestId>
  <HostId>...</HostId>
</Error>

As explained by AWS:

This error might occur for the following reasons:

An unpaginated ListBuckets request is made from an account that has an approved general purpose bucket quota higher than 10,000. You must make paginated requests to list the buckets in an account with more than 10,000 buckets.

The request is using the wrong signature version. Use AWS4-HMAC-SHA256 (Signature Version 4).

An access point can be created only for an existing bucket.

The access point is not in a state where it can be deleted.

An access point can be listed only for an existing bucket.

The next token is not valid.

At least one action must be specified in a lifecycle rule.

At least one lifecycle rule must be specified.

The number of lifecycle rules must not exceed the allowed limit of 1000 rules.

The range for the MaxResults parameter is not valid.

SOAP requests must be made over an HTTPS connection.

Amazon S3 Transfer Acceleration is not supported for buckets with non-DNS compliant names.

Amazon S3 Transfer Acceleration is not supported for buckets with periods (.) in their names.

The Amazon S3 Transfer Acceleration endpoint supports only virtual style requests.

Amazon S3 Transfer Acceleration is not configured on this bucket.

Amazon S3 Transfer Acceleration is disabled on this bucket.

Amazon S3 Transfer Acceleration is not supported on this bucket. For assistance, contact [Support](https://aws.amazon.com/contact-us/).

Amazon S3 Transfer Acceleration cannot be enabled on this bucket. For assistance, contact [Support](https://aws.amazon.com/contact-us/).

Conflicting values provided in HTTP headers and query parameters.

Conflicting values provided in HTTP headers and POST form fields.

CopyObject request made on objects larger than 5GB in size.

So the real question is if we make a CopyObject call, can we assume that any InvalidRequest that comes back is because the object was >5GB in size. Given the documentation I think that's OK? but i'm not really clear that it will remain OK going forward. I'll push a commit that does this and let you decide on the approach.

@tustvold
Copy link
Contributor

I guess another option would be to make this disabled by default and therefore opt-in... How do other clients handle this?

@james-rms
Copy link
Contributor Author

james-rms commented Jan 15, 2026

Go S3 SDK leaves this up to the caller to figure out: https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/s3#Client.CopyObject

Go Cloud SDK seems to not support copies >5GB, because copy calls are forwarded directly to the CopyObject API, and it doesn't expose a multipart copy method: https://github.com/google/go-cloud/blob/a52bb6614a70209265758ad7a795a4a3931fbe0b/blob/s3blob/s3blob.go#L856

@james-rms
Copy link
Contributor Author

@tustvold i've implemented this as disabled by default (>5GB copies will just fail) and opt-in. By default there is no pre-copy head request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable AWS client to copy objects >5GB in size

3 participants